DATA SCIENCE SESSIONS VOL. 3¶

A Foundational Python Data Science Course¶

Session 16: Generalized Linear Models I. Binomial Logistic Regression and its MLE. Multinomial Regression.¶

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers¶

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


Generalized Linear Models¶

We will now begin to introduce a set of extremely useful statistical learning models. Their name - Generalized Linear Models (GLMs) - suggests that their are somehow related to Simple and Multiple Linear Regression models and yet somehow go beyond them. That is correct: GLMs generalize the linear model, where predictors and their respective coefficients produce a linear combination of vectors, by introducing link functions to solve those kinds of problems that cannot be handled by Linear Regression. For example, what if the problem is not to predict a continuous value of the criterion, but the outcome variable is rather a dichotomy and then the problem becomes the one of categorization? E.g. predict the sex of a respondent given a set ${X}$ of their features? Enters Binomial Logistic Regression, the simplest GLM.

Another thing: GLMs cannot be estimated by minimizing the quadratic error as we have estimated Simple and Multiple Linear Regression in the previous Session15. The method used to estimate Linear Models is known as Least Squares Estimation. To fit GLMs to our data, we will introduce the concept of Likelihood in Probability Theory and learn about the Maximum Likelihood Estimate!

What happens when the assumptions of the Linear Model fail?¶

Let us briefly recall the assumptions of the (Multiple) Linear Regression model:

  • Variables are real numbers: both outcome and predictor variables are members of $R$, the set of real numbers; at least in theory they can take any real value from -Inf to Inf.
  • Linearity: there must be a linear relationship between outcome variable and the predictor variable(s).
  • Normality: it is assumed that the residuals (i.e model errors) are normally distributed.
  • Homoscedasticity: the variances of error terms (i.e. residuals) are similar across the values of the predictor variables.
  • No autocorrelation: the residuals are not autocorrelated.
  • No influential cases: no outliers are present.
  • No Multicollinearity (in Multiple Regression only): the predictors are not that highly correlated with each other.

What if we observe a set of variables that somehow describe a statistical experiment that can result in any of the two discrete outcomes? For example, we observe a description of a behavior of a person, quantified in some way, and organized into a set of variables that should be used to predict the sex of that person? Or any other similar problem where the outcome can take only two values, say 0 or 1 (and immediately recall the Binomial Distribution)?

The assumptions of the Linear Model obviously constrain its application in such cases. We ask the following question now: would it be possible to generalize, or expand, modify the Linear Model somehow to be able to encompass the categorization problem? Because it sounds so appealing to be able to have a set of predictors, combine them in a linear fashion, and estimate the coefficients so to be able to predict whether the outcome would turn this way or another?

There is a way to develop such a generalization of the Linear Model. In its simplest form it represents the Binomial Logistic Regression. Binomial Logistic Regression is very similar to multiple regression, except that for the outcome variable is now a categorical variable (i.e. it is measured on a nominal scale that is a dichotomy).

In [ ]:
### --- Setup - importing the libraries

# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# - data
import numpy as np
import pandas as pd

# - os
import os

# - ml
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

# - visualization
import matplotlib.pyplot as plt
import seaborn as sns

# - parameters
%matplotlib inline
pd.options.mode.chained_assignment = None  # default='warn'
sns.set_theme()
# - rng
rng = np.random.default_rng()
# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')

# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')

1. Binomial Logistic Regression¶

Let's recall the form of the Linear Model with any number of predictors:

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k + \epsilon$$

So we have a linear combination of $k$ predictors $\boldsymbol{X}$ plus the model error term $\epsilon$ on the RHS, and the outcome variable $Y$ on the LHS.

Now we assume that $Y$ can take only two possible values, call them $0$ and $1$ for ease of discussion. We want to predict whether $Y$ will happen to be ($1$) or not ($0$) given our observations of a set of predictors $\boldsymbol{X}$. However, in Binary Logistic Regression we do not predict the value of the outcome itself, but rather the probability that the outcome will turn out $1$ or $0$ given the predictors.

In the simplest possible case, where there is only one predictor $X_1$, this is exactly what we predict in Binary Logistic Regression:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1)}}$$

where $\beta_0$ and $\beta_1$ are the same old good linear model coefficients. As we will see, the linear coefficients have a new interpretation in Binary Logistic Regression - a one rather different that the one they receive in the scope of the Linear Model.

With $k$ predictors we have:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$

Now the above equations looks like it felt from the clear blue sky to solve the problem. There is a clear motivation for its form, of course: imagine that instead of predicting the state of $Y$ directly we decide to predicts the odds of $Y$ turning out $1$ instead of $0$:

$$odds = \frac{p_1}{1-p_1}$$

Now goes the trick: if instead of predicting the odds $p_1/(1-p_1)$ we decide to predict the log-odds (also called: logit) from a linear combination of predictors

$$log \left( \frac{p_1}{1-p_1} \right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k$$

it turns out that we can recover the odds by taking the exponent of both LHS and RHS:

$$\frac{p_1}{1-p_1} = e^{(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}$$

and then by following a simple algebraic rearrangement:

(1) let's write $l=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k$ for simplicity and use $l$ to replace the whole linear combination in whats follows;

(2) we have $log \left( \frac{p_1}{1-p_1} \right)=l$ then;

(3) taking the exponent of both sides we arive at $\frac{p_1}{1-p_1}=e^l$;

(4) immediately follows that

$\frac{p_1}{1-p_1}=\frac{1}{e^{-l}} \implies p_1=\frac{1-p_1}{e^{-l}} \implies p_1+\frac{p1}{e^{-l}}=\frac{1}{e^{-l}}\implies p_1e^{-l}+p_1=1 \implies p_1(1+e^{-l})=1$

and after rewriting $l$ as a linear combination again we find that the probability $p_1$ of the outcome $Y$ turning out $1$ is:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$

Now, imagine we set a following criterion: anytime we estimate $p_1$ to be larger than or equal to $.5$ we predict that $Y=1$, and anytime $p_1 < .5$ we predict that $Y=0$. What we need to do in order to be able to learn how to predict $Y$ in this way is to estimate the coefficients $b_0$, $b_1$, $b_2$, etc like we did in the case of a linear model. However, minimizing SSE will not work in this case: our predictions will be on a probability scale, while our observations are discrete, $0$ or $1$. We will have to find another way.

In [ ]:
# - loading the dataset
# - GitHub: https://github.com/dijendersaini/Customer-Churn-Model/blob/master/churn_data.csv
# - place it in your _data/ directory
churn_data = pd.read_csv(os.path.join(data_dir, 'churn_data.csv'))
churn_data.head(10)
Out[ ]:
customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG 1 No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE 34 Yes One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes
5 9305-CDSKC 8 Yes Month-to-month Yes Electronic check 99.65 820.5 Yes
6 1452-KIOVK 22 Yes Month-to-month Yes Credit card (automatic) 89.10 1949.4 No
7 6713-OKOMC 10 No Month-to-month No Mailed check 29.75 301.9 No
8 7892-POOKP 28 Yes Month-to-month Yes Electronic check 104.80 3046.05 Yes
9 6388-TABGU 62 Yes One year No Bank transfer (automatic) 56.15 3487.95 No

Data Preparation¶

In [ ]:
churn_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   tenure            7043 non-null   int64  
 2   PhoneService      7043 non-null   object 
 3   Contract          7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   Churn             7043 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 495.3+ KB
In [ ]:
# - some entries have missing values given as empty strings
churn_data.loc[488]
Out[ ]:
customerID                         4472-LVYGI
tenure                                      0
PhoneService                               No
Contract                             Two year
PaperlessBilling                          Yes
PaymentMethod       Bank transfer (automatic)
MonthlyCharges                          52.55
TotalCharges                                 
Churn                                      No
Name: 488, dtype: object
In [ ]:
# - use .replace method to replace empty strings with NaN values
churn_data = churn_data.replace(r'^\s*$', np.nan, regex=True)
churn_data.loc[488]
Out[ ]:
customerID                         4472-LVYGI
tenure                                      0
PhoneService                               No
Contract                             Two year
PaperlessBilling                          Yes
PaymentMethod       Bank transfer (automatic)
MonthlyCharges                          52.55
TotalCharges                              NaN
Churn                                      No
Name: 488, dtype: object
In [ ]:
# - we drop all the entries with missing values
churn_data = churn_data.dropna().reset_index(drop=True)
In [ ]:
churn_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   tenure            7032 non-null   int64  
 2   PhoneService      7032 non-null   object 
 3   Contract          7032 non-null   object 
 4   PaperlessBilling  7032 non-null   object 
 5   PaymentMethod     7032 non-null   object 
 6   MonthlyCharges    7032 non-null   float64
 7   TotalCharges      7032 non-null   object 
 8   Churn             7032 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 494.6+ KB
In [ ]:
# - notice that 'TotalCharges' values are non-numeric type, but they should be
# - this is due to the empty string values that were previously present
# - we convert them to numeric type
churn_data['TotalCharges'] = churn_data['TotalCharges'].astype('float')
churn_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   tenure            7032 non-null   int64  
 2   PhoneService      7032 non-null   object 
 3   Contract          7032 non-null   object 
 4   PaperlessBilling  7032 non-null   object 
 5   PaymentMethod     7032 non-null   object 
 6   MonthlyCharges    7032 non-null   float64
 7   TotalCharges      7032 non-null   float64
 8   Churn             7032 non-null   object 
dtypes: float64(2), int64(1), object(6)
memory usage: 494.6+ KB

Target: Predict churn from all numeric predictors¶

We use Binomial Logistic Regression to predict the probability $p$ of a given observation $\textbf{x}$, with features $(x_1, x_2, \ldots, x_k)$, belonging to one of the two binary categories {0, 1}. We compute these probabilites via

$$p = \frac{1}{1+e^{\beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k + \beta_0}},$$

where $\beta_1, \beta_2, \ldots, \beta_k$ are the model's parameters for the predictors, and $\beta_0$ is the intercept of the model.

In order to determine whether the predicted label $\hat{y}$ for a given observation $\textbf{x}$ has binary label 1 or 0, we impose a decision criterion $\sigma$ - a number in the (0, 1) interval. If $p > \sigma$, then we assign label $\hat{y} = 1$ to the observation $\textbf{x}$; else, we take $\hat{y} = 0$. Ususally, we take $\sigma = 0.5$.

The model is optimized by MLE (Maximum Likelihood Estimation), and the interpretation of the model coefficients is the following:

  • for a given predictor $x_i$, the exponential of its coefficient, $e^{\beta_i}$ tells us about the change $\Delta_{odds}$, where $\Delta_{odds}$ is the difference between $\frac{p_1}{1-p_1}$ following a unit increase in $x_i$ and before it - given that everything else is kept constant.
In [ ]:
### --- Preparing the model frame

# - extracting 'Churn' and all the numerical features columns
model_frame = churn_data[['Churn', 'tenure', 'MonthlyCharges', 'TotalCharges']]
model_frame.head()
Out[ ]:
Churn tenure MonthlyCharges TotalCharges
0 No 1 29.85 29.85
1 No 34 56.95 1889.50
2 Yes 2 53.85 108.15
3 No 45 42.30 1840.75
4 Yes 2 70.70 151.65
In [ ]:
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
Out[ ]:
Churn tenure MonthlyCharges TotalCharges
0 0 1 29.85 29.85
1 0 34 56.95 1889.50
2 1 2 53.85 108.15
3 0 45 42.30 1840.75
4 1 2 70.70 151.65
In [ ]:
predictors = model_frame.columns.drop('Churn')
predictors
Out[ ]:
Index(['tenure', 'MonthlyCharges', 'TotalCharges'], dtype='object')
In [ ]:
# --- Composing the fomula of the model

# - right side of the formula
formula = ' + '.join(predictors)

# - left side of the formula
formula = 'Churn ~ ' + formula

formula
Out[ ]:
'Churn ~ tenure + MonthlyCharges + TotalCharges'
In [ ]:
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
         Current function value: 0.453372
         Iterations 7
In [ ]:
binomial_linear_model.summary()
Out[ ]:
Logit Regression Results
Dep. Variable: Churn No. Observations: 7032
Model: Logit Df Residuals: 7028
Method: MLE Df Model: 3
Date: Wed, 24 May 2023 Pseudo R-squ.: 0.2170
Time: 17:51:08 Log-Likelihood: -3188.1
converged: True LL-Null: -4071.7
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -1.5988 0.117 -13.628 0.000 -1.829 -1.369
tenure -0.0671 0.005 -12.297 0.000 -0.078 -0.056
MonthlyCharges 0.0302 0.002 17.585 0.000 0.027 0.034
TotalCharges 0.0001 6.14e-05 2.361 0.018 2.47e-05 0.000

N.B. There is a bug in the Wald's Z, look:

The reason why the Wald statistic should be used cautiously is because, when the regression coefficient is large, the standard error tends to become inflated, resulting in the Wald statistic being underestimated (see Menard, 1995). The inflation of the standard error increases the probability of rejecting a predictor as being significant when in reality it is making a significant

contribution to the model (i.e. you are more likely to make a Type II error). From: Andy Field, DISCOVERING STATISTICS USING SPSS, Third Edition, Sage.

In [ ]:
# - model's parameters
binomial_linear_model.params
Out[ ]:
Intercept        -1.598827
tenure           -0.067114
MonthlyCharges    0.030200
TotalCharges      0.000145
dtype: float64
In [ ]:
# - exponential of the model's parameters
np.exp(binomial_linear_model.params)
Out[ ]:
Intercept         0.202133
tenure            0.935088
MonthlyCharges    1.030660
TotalCharges      1.000145
dtype: float64

Coefficients in BLR¶

The $\Delta Odds$ (Odds Ratio)

Do not forget that you have transformed your linear combination of model coefficients and predictors into a log-odds space: the logistic regression coefficient $\beta$ associated with a predictor X is the expected change in log(odds).

So, by taking $e^{\beta_i}$, your coefficient now says:

  • take the odds after a unit change in the predictor $X_i$
  • take the original odds (before the unit change in predictor $X_i$)
  • $\Delta Odds$ = (odds after a unit change in the predictor)/(original odds)
  • will change by $e^{\beta_i}$.

Which means that

  • if $e^{\beta_i}>1$, then predictor $X_i$ increases the odds of outcome vs no outcome, while
  • if $e^{\beta_i}<1$, then predictor $X_i$ decreases the odds of outcome vs no outcome.
In [ ]:
# - predicting the probabilities
probabilities = binomial_linear_model.predict()
probabilities[:10]
Out[ ]:
array([0.31861382, 0.13162129, 0.47723817, 0.04417324, 0.60445587,
       0.72962176, 0.47459352, 0.2095354 , 0.53216748, 0.02770232])
In [ ]:
# - predicting binary labels, taking \sigma = 0.5
predictions = (probabilities > .5).astype('int')
predictions[:10]
Out[ ]:
array([0, 0, 0, 0, 1, 1, 0, 0, 1, 0])
In [ ]:
# - observed vs. predicted labels

predictions_df = pd.DataFrame()

predictions_df['observation'] = model_frame['Churn']
predictions_df['prediction'] = predictions

predictions_df.head()
Out[ ]:
observation prediction
0 0 0
1 0 0
2 1 0
3 0 0
4 1 1
In [ ]:
# - accuracy of the model
accuracy = predictions_df['observation'] == predictions_df['prediction']
accuracy = np.sum(accuracy)/len(accuracy)
np.round(accuracy, 4)
Out[ ]:
0.785
In [ ]:
### --- Model diagnostics

# - Log-likelihood of model
model_loglike = binomial_linear_model.llf
model_loglike
Out[ ]:
-3188.1130873174216
In [ ]:
# - model deviance
residuals_deviance = binomial_linear_model.resid_dev
model_deviance = np.sum(residuals_deviance**2)
model_deviance
Out[ ]:
6376.226174634843
In [ ]:
# - another way to compute model deviance
np.sum(residuals_deviance**2) == -2*model_loglike
Out[ ]:
True

The Akaike Information Criterion (AIC)¶

The Akaike Information Criterion (AIC) is a statistical measure used to evaluate the goodness-of-fit of a model. It is based on the principle of parsimony, which states that simpler models should be preferred over more complex ones, all else being equal.

The AIC is defined as follows:

$$AIC = -2\ln(\mathcal{L}) + 2k $$

where $\mathcal{L}$ is the model likelihood and $k$ is the number of parameters in the model.

The AIC penalizes models with more parameters, by adding a penalty term $2k$ to the log-likelihood $-2\ln(\mathcal{L})$. This penalty term is larger for models with more parameters, and hence it discourages overfitting and encourages simpler models.

The AIC can be used to compare different models and select the best one based on their AIC values. The model with the lowest AIC value is preferred, as it strikes a good balance between goodness-of-fit and simplicity.

In [ ]:
# - Akaike Information Criterion (AIC)
binomial_linear_model.aic
Out[ ]:
6384.226174634843
In [ ]:
# - another way to compute AIC
aic = -2*model_loglike + 2*len(predictors)
aic
Out[ ]:
6382.226174634843

Model effect: comparison to the Null Model

In [ ]:
# - Log-likelihood of model
model_loglike = binomial_linear_model.llf
model_loglike
Out[ ]:
-3188.1130873174216
In [ ]:
# Value of the constant-only loglikelihood
null_loglike = binomial_linear_model.llnull
null_loglike
Out[ ]:
-4071.6775733255813
In [ ]:
# - Comparison to the Null Model which follows the Chi-Square distribution

# - differece between deviances of the Null Model and our model:
# - Likelihood ratio chi-squared statistic; -2*(llnull - llf)
dev_diff = binomial_linear_model.llr
dev_diff
Out[ ]:
1767.1289720163195
In [ ]:
-2*(null_loglike - model_loglike)
Out[ ]:
1767.1289720163195

Target: Predict churn from all the predictors¶

In [ ]:
# - exponential of the parameters and AIC of the model using only numerical predictors (a reminder)
np.exp(binomial_linear_model.params)
Out[ ]:
Intercept         0.202133
tenure            0.935088
MonthlyCharges    1.030660
TotalCharges      1.000145
dtype: float64
In [ ]:
binomial_linear_model.aic
Out[ ]:
6384.226174634843
In [ ]:
### --- Prepering the dataset

# - droping the 'customerID' column
model_frame = churn_data.drop(columns='customerID')
model_frame.head()
Out[ ]:
tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 1 No Month-to-month Yes Electronic check 29.85 29.85 No
1 34 Yes One year No Mailed check 56.95 1889.50 No
2 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
3 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
4 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes
In [ ]:
model_frame.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   tenure            7032 non-null   int64  
 1   PhoneService      7032 non-null   object 
 2   Contract          7032 non-null   object 
 3   PaperlessBilling  7032 non-null   object 
 4   PaymentMethod     7032 non-null   object 
 5   MonthlyCharges    7032 non-null   float64
 6   TotalCharges      7032 non-null   float64
 7   Churn             7032 non-null   object 
dtypes: float64(2), int64(1), object(5)
memory usage: 439.6+ KB
In [ ]:
# - encoding 'Churn' column to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
Out[ ]:
tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 1 No Month-to-month Yes Electronic check 29.85 29.85 0
1 34 Yes One year No Mailed check 56.95 1889.50 0
2 2 Yes Month-to-month Yes Mailed check 53.85 108.15 1
3 45 No One year No Bank transfer (automatic) 42.30 1840.75 0
4 2 Yes Month-to-month Yes Electronic check 70.70 151.65 1
In [ ]:
predictors = model_frame.columns.drop('Churn')
predictors
Out[ ]:
Index(['tenure', 'PhoneService', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges'],
      dtype='object')
In [ ]:
# --- Composing the fomula of the model

# - right side of the formula
formula = ' + '.join(predictors)

# - left side of the formula
formula = 'Churn ~ ' + formula

formula
Out[ ]:
'Churn ~ tenure + PhoneService + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges'
In [ ]:
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
         Current function value: 0.424667
         Iterations 8
In [ ]:
binomial_linear_model.summary()
Out[ ]:
Logit Regression Results
Dep. Variable: Churn No. Observations: 7032
Model: Logit Df Residuals: 7021
Method: MLE Df Model: 10
Date: Wed, 24 May 2023 Pseudo R-squ.: 0.2666
Time: 17:51:11 Log-Likelihood: -2986.3
converged: True LL-Null: -4071.7
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -0.8446 0.176 -4.794 0.000 -1.190 -0.499
PhoneService[T.Yes] -0.8419 0.122 -6.926 0.000 -1.080 -0.604
Contract[T.One year] -0.9171 0.103 -8.876 0.000 -1.120 -0.715
Contract[T.Two year] -1.8135 0.172 -10.565 0.000 -2.150 -1.477
PaperlessBilling[T.Yes] 0.4230 0.073 5.812 0.000 0.280 0.566
PaymentMethod[T.Credit card (automatic)] -0.1006 0.112 -0.895 0.371 -0.321 0.120
PaymentMethod[T.Electronic check] 0.4126 0.093 4.453 0.000 0.231 0.594
PaymentMethod[T.Mailed check] -0.1194 0.113 -1.061 0.289 -0.340 0.101
tenure -0.0606 0.006 -9.932 0.000 -0.073 -0.049
MonthlyCharges 0.0225 0.002 11.022 0.000 0.019 0.027
TotalCharges 0.0003 6.81e-05 4.205 0.000 0.000 0.000
In [ ]:
# - exponentials of the new model parameters
np.exp(binomial_linear_model.params)
Out[ ]:
Intercept                                   0.429725
PhoneService[T.Yes]                         0.430907
Contract[T.One year]                        0.399695
Contract[T.Two year]                        0.163084
PaperlessBilling[T.Yes]                     1.526520
PaymentMethod[T.Credit card (automatic)]    0.904265
PaymentMethod[T.Electronic check]           1.510691
PaymentMethod[T.Mailed check]               0.887475
tenure                                      0.941170
MonthlyCharges                              1.022802
TotalCharges                                1.000286
dtype: float64
In [ ]:
# - AIC of the new model
binomial_linear_model.aic
Out[ ]:
5994.512151127206

2. BLR using scikit-learn¶

Target: Predicting churn from numerical predictors¶

In [ ]:
# - import scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
In [ ]:
### --- Preparing the variables 

# - feature matrix
X = churn_data[['tenure', 'MonthlyCharges', 'TotalCharges']].values

# - target variable
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
In [ ]:
## --- Fitting the logistic model to the numerical data
log_reg = LogisticRegression()
log_reg.fit(X, y)
Out[ ]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [ ]:
# - coefficients of the model
log_reg.coef_, log_reg.intercept_
Out[ ]:
(array([[-0.06711264,  0.03019993,  0.00014507]]), array([-1.59884834]))
In [ ]:
# - exponentials of the model coefficients
np.exp(log_reg.coef_), np.exp(log_reg.intercept_)
Out[ ]:
(array([[0.93508987, 1.03066058, 1.00014508]]), array([0.20212917]))
In [ ]:
# - model's accuracy
round(log_reg.score(X, y), 4)
Out[ ]:
0.785
In [ ]:
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
Out[ ]:
array([[4693,  470],
       [1042,  827]])

Target: Predicting churn from all the predictors¶

In [ ]:
churn_data.head()
Out[ ]:
customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG 1 No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE 34 Yes One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes
In [ ]:
### --- Composing the feature matrix

# - dropping all the non-numerical and non-binary categorical columns
X0 = churn_data.drop(columns=['customerID', 'Contract', 'PaymentMethod', 'Churn'])

# - encoding binary categorical features to binary values
X0['PaperlessBilling'] = X0['PaperlessBilling'].apply(lambda x: int(x == 'Yes'))
X0['PhoneService'] = X0['PhoneService'].apply(lambda x: int(x == 'Yes'))

X0.head()
Out[ ]:
tenure PhoneService PaperlessBilling MonthlyCharges TotalCharges
0 1 0 1 29.85 29.85
1 34 1 0 56.95 1889.50
2 2 1 1 53.85 108.15
3 45 0 0 42.30 1840.75
4 2 1 1 70.70 151.65
In [ ]:
# - casting the data frame into a matrix
X0 = X0.values
X0
Out[ ]:
array([[1.0000e+00, 0.0000e+00, 1.0000e+00, 2.9850e+01, 2.9850e+01],
       [3.4000e+01, 1.0000e+00, 0.0000e+00, 5.6950e+01, 1.8895e+03],
       [2.0000e+00, 1.0000e+00, 1.0000e+00, 5.3850e+01, 1.0815e+02],
       ...,
       [1.1000e+01, 0.0000e+00, 1.0000e+00, 2.9600e+01, 3.4645e+02],
       [4.0000e+00, 1.0000e+00, 1.0000e+00, 7.4400e+01, 3.0660e+02],
       [6.6000e+01, 1.0000e+00, 1.0000e+00, 1.0565e+02, 6.8445e+03]])
In [ ]:
# - categories of the 'Contract' variable
churn_data['Contract'].unique()
Out[ ]:
array(['Month-to-month', 'One year', 'Two year'], dtype=object)
In [ ]:
# - categories of the 'PaymentMethod' variable
churn_data['PaymentMethod'].unique()
Out[ ]:
array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
       'Credit card (automatic)'], dtype=object)
In [ ]:
# - we want to recreate the previous statsmodels model that was using all the predictors
# - to achieve this we one-hot (dummy) encode non-binary categorical predictors
# - statsmodels chooses the first category in order of appearance in the dataset as the reference category
# - we pass the reference category manually as an argument to the OneHotEncoder

enc_contract = OneHotEncoder(drop=['Month-to-month'], sparse=False)
dummy_contract = enc_contract.fit_transform(churn_data['Contract'].values.reshape(-1, 1))

enc_payment = OneHotEncoder(drop=['Bank transfer (automatic)'], sparse=False)
dummy_payment = enc_payment.fit_transform(churn_data['PaymentMethod'].values.reshape(-1, 1))
In [ ]:
# - concatenating values of the numerical predictors and encoded binary values with the encoded non-binary values
# - into a feature matrix
X = np.concatenate((X0, dummy_contract, dummy_payment), axis=-1)
display(X)

# - target variable; encoding to binary values
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
array([[ 1.,  0.,  1., ...,  0.,  1.,  0.],
       [34.,  1.,  0., ...,  0.,  0.,  1.],
       [ 2.,  1.,  1., ...,  0.,  0.,  1.],
       ...,
       [11.,  0.,  1., ...,  0.,  1.,  0.],
       [ 4.,  1.,  1., ...,  0.,  0.,  1.],
       [66.,  1.,  1., ...,  0.,  0.,  0.]])
In [ ]:
### --- Fitting the logistic model to all the data
log_reg = LogisticRegression(solver='newton-cg', penalty='none')
log_reg.fit(X, y)
Out[ ]:
LogisticRegression(penalty='none', solver='newton-cg')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(penalty='none', solver='newton-cg')
In [ ]:
# - model's accuracy
round(log_reg.score(X, y), 4)
Out[ ]:
0.7964
In [ ]:
# - exponential of the model parameters
# - ordering corresponds to the ordering of the features in the feature matrix
np.exp(log_reg.coef_), np.exp(log_reg.intercept_)
Out[ ]:
(array([[0.9411695 , 0.43090694, 1.52652001, 1.02280179, 1.00028639,
         0.39969523, 0.16308373, 0.90426487, 1.51069102, 0.88747499]]),
 array([0.42972514]))
In [ ]:
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
Out[ ]:
array([[4615,  548],
       [ 884,  985]])

3. MLE for Binomial Logistic Regression¶

Say we have observed the following data: $HHTHTTHHHT$. Assume that we know the parameter $p_H$. We can compute the Likelihood function from the following equation:

$\mathcal{L}(p_H|HHTHTTHHHT)$ exactly as we did before. Now, this is the general form of the Binomial Likelihood (where $Y$ stands for the observed data):

$$\mathcal{L}(p|Y) = p_1^y(1-p_1)^{n-y}$$

where $y$ is the number of successes and $n$ the total number of observations. For each observed data point then we have

$$\mathcal{L}(p|y_i) = p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

where ${y_i}$ is the observed value of the outcome, $Y$, and $\bar{y_i}$ is its complement (e.g. $1$ for $0$ and $0$ for $1$). This form just determines which value will be used in the computation of the Likelihood function at each observed data point: it will be either $p_1$ or $1-p_1$. The likelihood function for a given value of $p_1$ for the whole dataset is computed by multiplying the values of $\mathcal{L}(p|y_i)$ across the whole dataset (remember that multiplication in Probability is what conjunction is in Logic and Algebra).

Q: But... how do we get to $p_1$, the parameter value that we will use at each data point? A: We will search the parameter space, of course, $\beta_0, \beta_1, ... \beta_k$ of linear coefficients in our Binary Logistic Model, computing $p_1$ every time, and compute the likelihood function from it! In other words: we will search the parameter space to find the combination of $\beta_0, \beta_1, ... \beta_k$ that produces the maximum of the likelihood function similarly as we have searched the space of linear coefficients to find the combination that minimizes the squared error in Simple Linear Regression.

So what combination of the linear coefficients is the best one?

It is the one which gives the Maximum Likelihood. This approach, known as Maximum Likelihood Estimation (MLE), stands behind many important statistical learning models. It presents the corner stone of the Statistical Estimation Theory. It is contrasted with the Least Squares Estimation that we have earlier used to estimate Simple and Multiple Linear Regression models.

Now, there is a technical problem related to this approach. To obtain the likelihood for the whole dataset one needs to multiply as many very small numbers as there are data points. That can cause computational problems related to the smallest real numbers that can be represented by digital computers. The workaround is to use the logarithm of likelihood instead, known as Log-Likelihood ($LL$).

Thus, while the Likelihood function for the whole dataset would be

$$\mathcal{L}(p|Y) = \prod_{i=1}^{n}p_1^{y_i}(1-p_1)^{\bar{y_i}}$$

the Log-Likelihood function would be:

$$LL(p|Y) = \sum_{i=1}^{n} y_ilog(p_1)+\bar{y_i}log(1-p_1)$$

And finally here is how we solve the Binomial Logistic Regression problem:

  • search throught the parameter space spawned by linear coefficients $\beta_0, \beta_1, ... \beta_k$,
  • predict $p_1$ from the model and a particular combination of the parameters,
  • compute the value of the Likelihood function for the whole dataset,
  • find the combination that yields the maximum of the Likelihood function.

Technically, in optimization we would not go exactly for the maximum of the Likelihood function, because we use $LL$ instead of $\mathcal{L}(p|Y)$. The solution is to minimize the negative $LL$, sometimes written simply as $NLL$, the Negative Log-Likelihood function.

In [ ]:
model_frame = churn_data[['Churn', 'MonthlyCharges', 'TotalCharges']]
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame
Out[ ]:
Churn MonthlyCharges TotalCharges
0 0 29.85 29.85
1 0 56.95 1889.50
2 1 53.85 108.15
3 0 42.30 1840.75
4 1 70.70 151.65
... ... ... ...
7027 0 84.80 1990.50
7028 0 103.20 7362.90
7029 0 29.60 346.45
7030 1 74.40 306.60
7031 0 105.65 6844.50

7032 rows × 3 columns

Implement the model predictive pass given the parameters; the folowing blr_predict() function is nothing else than the implementation of the following expression:

$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$
In [ ]:
def blr_predict(params, data):
    # - grab parameters
    beta_0 = params[0]
    beta_1 = params[1]
    beta_2 = params[2]
    # - predict: model function
    x1 = data["MonthlyCharges"]
    x2 = data["TotalCharges"]
    # - linear combination:
    lc = beta_0 + beta_1*x1 + beta_2*x2
    ep = np.exp(-lc)
    p = 1/(1+ep)
    return(p) 

Test blr_predict()

In [ ]:
test_params = np.array([-2.7989, .0452, -.0006])
predictions = blr_predict(params=test_params, data=model_frame)
predictions
Out[ ]:
0       0.187309
1       0.204491
2       0.394181
3       0.120110
4       0.575848
          ...   
7027    0.460025
7028    0.072292
7029    0.158578
7030    0.593878
7031    0.106194
Length: 7032, dtype: float64

Now define the Negative Log-Likelihood function:

In [ ]:
from scipy.stats import binom

def blr_nll(params, data):
    # - predictive pass
    p = blr_predict(params, data)
    # - joint negative log-likelihood
    # - across all observations
    nll = -binom.logpmf(data["Churn"], 1, p).sum()
    return(nll)

Test blr_nll():

In [ ]:
# - test blr_nll()
test_params = np.array([-2.7989, .0452, -.0006])
blr_nll(params=test_params, data=model_frame)
Out[ ]:
3284.57719221028

Optimize!

In [ ]:
from scipy.optimize import minimize

# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-3, high=0, size=1)
init_beta_1 = np.random.uniform(low=-.05, high=.05, size=1)
init_beta_2 = np.random.uniform(low=-.001, high=.001, size=1)
init_pars = [init_beta_0, init_beta_1, init_beta_2]

# - optimize w. Nelder-Mead
optimal_model = minimize(
    # - fun(parameters, args)
    fun=blr_nll,
    args = model_frame, 
    x0 = init_pars, 
    method='Nelder-Mead',
    options={'maxiter':1e6, 
            'maxfev':1e6,
            'fatol':1e-6})

# - optimal parameters
for param in optimal_model.x:
    print(param)
-2.7989029105276146
0.04523074574378202
-0.0006181439990904628
/tmp/ipykernel_253256/465358202.py:10: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
  optimal_model = minimize(
In [ ]:
optimal_model.fun
Out[ ]:
3283.393915947146

Check against statsmodels

In [ ]:
predictors = model_frame.columns.drop('Churn')
formula = ' + '.join(predictors)
formula = 'Churn ~ ' + formula
print(formula)
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
binomial_linear_model.summary()
Churn ~ MonthlyCharges + TotalCharges
Optimization terminated successfully.
         Current function value: 0.466922
         Iterations 6
Out[ ]:
Logit Regression Results
Dep. Variable: Churn No. Observations: 7032
Model: Logit Df Residuals: 7029
Method: MLE Df Model: 2
Date: Wed, 24 May 2023 Pseudo R-squ.: 0.1936
Time: 17:51:15 Log-Likelihood: -3283.4
converged: True LL-Null: -4071.7
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
Intercept -2.7989 0.089 -31.548 0.000 -2.973 -2.625
MonthlyCharges 0.0452 0.001 31.620 0.000 0.042 0.048
TotalCharges -0.0006 1.97e-05 -31.373 0.000 -0.001 -0.001

Plot the Negative Log-Likelihood Function

In [ ]:
# - from statsmodels: beta_0 is -2.7989, beta_1 is .0452, and beta_2 is -.0006
beta_0 = -2.7989
beta_1_vals = np.linspace(-.05,.05,100)
beta_2_vals = np.linspace(-.001,.001,100)
grid = np.array([(beta_1, beta_2) for beta_1 in beta_1_vals for beta_2 in beta_2_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_1", 1: "beta_2"})
nll = []
for i in range(grid.shape[0]):
    pars = [beta_0, grid['beta_1'][i], grid['beta_2'][i]]
    nll.append(blr_nll(pars, model_frame))
grid['nll'] = nll
grid.sort_values('nll', ascending=False, inplace=True)
grid.head(100)
Out[ ]:
beta_1 beta_2 nll
9999 0.050000 0.001000 17814.865806
9998 0.050000 0.000980 17570.648096
9899 0.048990 0.001000 17562.334042
9997 0.050000 0.000960 17326.772023
9898 0.048990 0.000980 17318.646658
... ... ... ...
9492 0.044949 0.000859 14892.359746
101 -0.048990 -0.000980 14878.889700
4 -0.050000 -0.000919 14845.861773
102 -0.048990 -0.000960 14821.273142
200 -0.047980 -0.001000 14796.700924

100 rows × 3 columns

In [ ]:
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
    x=grid['beta_1'], 
    y=grid['beta_2'], 
    z=grid['nll'], 
    color='red', 
    opacity=0.50)])
fig.update_layout(scene = dict(
                    xaxis_title='Beta_1',
                    yaxis_title='Beta_2',
                    zaxis_title='NLL'),
                    width=700,
                    margin=dict(r=20, b=10, l=10, t=10))
fig.show()

4. Multinomial Regression¶

The Multinomial Regression model is a powerful classification tool. Consider a problem where some outcome variable can result in more than two discrete outcomes. For example, a customer visiting a webshop can end up their visit (a) buying nothing, (b) buying some Product A, or (c) Product B, or (d) Product C, etc. If we have some information about a particular customer's journey through the website (e.g. how much time did they spend on some particular pages, did they visit the webshop before or not, or any other information that customers might have chose to disclose on their sign-up...), we can use it as a set of predictors of customer behavior resulting in any of the (a), (b), (c), (d). We do that by means of a simple extension of the Binomial Logistic Regression that is used to solve for dichotomies: enters the Multinomial Regression model.

The Model¶

First, similar to what happens in dummy coding, given a set of $K$ possible outcomes we choose one of them as a baseline. Thus all results of the Multionomial Regression will be interpreted as effects relative to that baseline outcome category, for example: for a unit increase in predictor $X_1$ what is the change in odds to switch from (a) buying nothing to (b) buying Product A. We are already familiar with this logic, right?

So, consider a set of $K-1$ independent Binary Logistic Models with only one predictor $X$ where the baseline is now referred to as $K$:

$$log\frac{P(Y_i = 1)}{P(Y_i = K)} = \beta_{1,0} + \beta_{1,1}X$$$$log\frac{P(Y_i = 2)}{P(Y_i = K)} = \beta_{2,0} + \beta_{2,1}X$$$$log\frac{P(Y_i = K-1)}{P(Y_i = K)} = \beta_{K-1,0} + \beta_{K-1,1}X$$

Obviously, we are introducing a set of new regression coefficients ($\beta_{k,\cdot}$) for each possible value of the outcome $k = 1, 2,.., K-1$. The log-odds are on the LHS while the linear model remains on the RHS.

Now we exponentiate the equations to arrive at the expressions for odds:

$$\frac{P(Y_i = 1)}{P(Y_i = K)} = e^{\beta_{1,0} + \beta_{1,1}X}$$$$\frac{P(Y_i = 2)}{P(Y_i = K)} = e^{\beta_{2,0} + \beta_{2,1}X}$$$$\frac{P(Y_i = K-1)}{P(Y_i = K)} = e^{\beta_{K-1,0} + \beta_{K-1,1}X}$$

And solve for $P(Y_i = 1), P(Y_i = 2),.. P(Y_i = K-1)$:

$$P(Y_i = 1) = P(Y_i = K)e^{\beta_{1,0} + \beta_{1,1}X}$$$$P(Y_i = 2) = P(Y_i = K)e^{\beta_{2,0} + \beta_{2,1}X}$$$$P(Y_i = K-1) = P(Y_i = K)e^{\beta_{K-1,0} + \beta_{K-1,1}X}$$

From the fact that all probabilities $P(Y_i = 1), P(Y_i = 2), .., P(Y_i = K)$ must sum to one it can be shown that

$$P(Y_i = K) = \frac{1}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$

Because:

(1) $P(Y_i = 1) + P(Y_i = 2) + ... + P(Y_i = K) = 1$, we have

(2) $P(Y_i = K) + P(Y_i = K)e^{\beta_{1,0} + \beta_{1,1}X} + P(Y_i = K)e^{\beta_{2,0} + \beta_{2,1}X} + ... + P(Y_i = K=1)e^{\beta_{K-1,0} + \beta_{K-1,1}X} = 1$

(3) and then replace $e^{\beta_{1,0} + \beta_{1,1}X}$ by $l_1$, $e^{\beta_{2,0} + \beta_{2,1}X}$ by $l_2$, and $e^{\beta_{k,0} + \beta_{k,1}X}$ by $l_k$ in the general case, we have

(4) $P(Y_i = K) + P(Y_i = K)e^{l_1} + P(Y_i = K)e^{l_2} + ... + P(Y_i = K-1)e^{l_{K-1}} = 1$

(5) $P(Y_i = K)[1 + e^{l_1} + e^{l_2} + ... + e^{l_{K-1}}] = 1$ so that

(6) $P(Y_i = K) = \frac{1}{1 + e^{l_1} + e^{l_2} + ... + e^{l_{K-1}}} = \frac{1}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$

It is easy now to derive the expressions for all $K-1$ probabilities of the outcome resulting in a particular class:

$$P(Y_i = 1) = \frac{e^{\beta_{1,0} + \beta_{1,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$$$P(Y_i = 2) = \frac{e^{\beta_{2,0} + \beta_{2,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$$$P(Y_i = K-1) = \frac{e^{\beta_{K-1,0} + \beta_{K-1,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$
In [ ]:
# - loading the dataset
# - Kaggle: https://www.kaggle.com/datasets/uciml/iris
# - place it in your _data/ directory
iris_data = pd.read_csv(os.path.join(data_dir, 'Iris.csv'), index_col='Id')
iris_data.head(10)
Out[ ]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
Id
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5.0 3.4 1.5 0.2 Iris-setosa
9 4.4 2.9 1.4 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa
In [ ]:
# - counting the instances of each category
iris_data['Species'].value_counts()
Out[ ]:
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64
In [ ]:
# - info on the variables
iris_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 7.0+ KB

Target: predict species from all continuous predictors¶

We use Multinomial Logistic Regression Model to predict the most probable category for the given observation $\textbf{x}$ with features $(x_1, x_2, \ldots, x_k)$. Assume that our target variable $y$ belongs to one of categories from the set $\{1, 2, \ldots, C\}$. In MNR we usually select one category as the referrence category; let that be the category $C$. Then, the probability that the target variable $y$ belongs to category $c = 1,\ldots,C-1$ is calculated via

$$P(y = c) = \frac{e^{\beta^{(c)}_1x_1 + \beta^{(c)}_2x_2 + \cdots + \beta^{(c)}_kx_k + \beta_0}}{1+\sum_{j=1}^{C-1}e^{\beta^{(j)}_1x_1 + \beta^{(j)}_2x_2 + \cdots + \beta^{(j)}_kx_k + \beta_0}},$$

and the probability that it belogns to the referrence category $C$ is

$$P(y = C) = \frac{1}{1+\sum_{j=1}^{C-1}e^{\beta^{(j)}_1x_1 + \beta^{(j)}_2x_2 + \cdots + \beta^{(j)}_kx_k + \beta_0}},$$

where $\beta^{(j)}_1, \beta^{(j)}_2, \ldots, \beta^{(j)}_k,\ j=1,\ldots,C$ are the model's parameters for predictors and target variable categories, and $n$ is the intercept of the model.

After calculating all the probabilities $P(y = c),\ c=1,\ldots,C$ we predict the target variable as

$$\hat{y} = \textrm{argmax}_{c=1,\ldots,C}P(y=c).$$

The model is estimated by MLE (Maximum Likelihood Estimation). For each category $c$ - except for the referrence $C$, of course - we obtain a set of coefficients. Each model coefficient, in each category, tells us about the $\Delta_{odds}$ in favor of the target category, for a unit change of a predictor, in comparison with the baseline category, and given that everything else is kept constant.

In [ ]:
### --- Preparing the variables 

# - feature matrix
X = iris_data.drop(columns='Species')
# - we append a constant column of ones to the feature matrix
X = sm.add_constant(X)
print(X[:10])


# - we impose the ordering to the categories of the target vector; the first category is the referrence category
cat_type = pd.CategoricalDtype(categories=["Iris-versicolor", "Iris-virginica", "Iris-setosa"], ordered=True)
y = iris_data['Species'].astype(cat_type)
    const  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
Id                                                                 
1     1.0            5.1           3.5            1.4           0.2
2     1.0            4.9           3.0            1.4           0.2
3     1.0            4.7           3.2            1.3           0.2
4     1.0            4.6           3.1            1.5           0.2
5     1.0            5.0           3.6            1.4           0.2
6     1.0            5.4           3.9            1.7           0.4
7     1.0            4.6           3.4            1.4           0.3
8     1.0            5.0           3.4            1.5           0.2
9     1.0            4.4           2.9            1.4           0.2
10    1.0            4.9           3.1            1.5           0.1
In [ ]:
# - fitting the MNR model to the data; we use the Newton's Conjugate Gradient method as the optimizer to compute the
# - models coefficients
mnr_model = sm.MNLogit(exog=X, endog=y).fit(method='ncg', maxiter=150)
mnr_model.summary()
Optimization terminated successfully.
         Current function value: 0.039663
         Iterations: 26
         Function evaluations: 27
         Gradient evaluations: 27
         Hessian evaluations: 26
Out[ ]:
MNLogit Regression Results
Dep. Variable: Species No. Observations: 150
Model: MNLogit Df Residuals: 140
Method: MLE Df Model: 8
Date: Wed, 24 May 2023 Pseudo R-squ.: 0.9639
Time: 17:51:25 Log-Likelihood: -5.9494
converged: True LL-Null: -164.79
Covariance Type: nonrobust LLR p-value: 7.056e-64
Species=Iris-virginica coef std err z P>|z| [0.025 0.975]
const -42.6241 25.698 -1.659 0.097 -92.992 7.744
SepalLengthCm -2.4650 2.394 -1.030 0.303 -7.158 2.227
SepalWidthCm -6.6810 4.479 -1.492 0.136 -15.460 2.098
PetalLengthCm 9.4273 4.735 1.991 0.046 0.146 18.708
PetalWidthCm 18.2837 9.741 1.877 0.061 -0.808 37.375
Species=Iris-setosa coef std err z P>|z| [0.025 0.975]
const 1.2685 3115.887 0.000 1.000 -6105.758 6108.295
SepalLengthCm 2.1956 882.047 0.002 0.998 -1726.584 1730.976
SepalWidthCm 6.4425 431.475 0.015 0.988 -839.233 852.118
PetalLengthCm -10.8411 560.407 -0.019 0.985 -1109.218 1087.536
PetalWidthCm -5.5832 1076.319 -0.005 0.996 -2115.129 2103.963
In [ ]:
# - confusion matrix for our model and given data; rows/columns are on par with the ordering of categorical variable
mnr_model.pred_table()
Out[ ]:
array([[49.,  1.,  0.],
       [ 1., 49.,  0.],
       [ 0.,  0., 50.]])
In [ ]:
# - accuracy of the model
correct_classes = np.trace(mnr_model.pred_table())
print(f'Correct observations: {correct_classes}')
num_obs = np.sum(mnr_model.pred_table())
print(f'Total observations: {num_obs}')
print(f'The accuracy of our model: {round(correct_classes/num_obs, 4)}')
Correct observations: 148.0
Total observations: 150.0
The accuracy of our model: 0.9867
In [ ]:
# - model's prediction of probabilities; columns correspond to the ordering of categorical variable
mnr_model.predict()[:10]
Out[ ]:
array([[7.41125993e-09, 1.15691965e-35, 9.99999993e-01],
       [2.88110802e-07, 2.07898383e-32, 9.99999712e-01],
       [4.16729326e-08, 5.04109728e-34, 9.99999958e-01],
       [8.64266383e-07, 1.71937382e-31, 9.99999136e-01],
       [4.84676016e-09, 4.96321518e-36, 9.99999995e-01],
       [2.30190564e-08, 7.76397915e-34, 9.99999977e-01],
       [7.39468847e-08, 4.80628656e-33, 9.99999926e-01],
       [5.19830060e-08, 5.19879646e-34, 9.99999948e-01],
       [1.64479719e-06, 7.94019900e-31, 9.99998355e-01],
       [2.55927514e-07, 3.90503168e-33, 9.99999744e-01]])
In [ ]:
# - model's prediction of categories; numbers correspond to the ordering of categorical variable
preds = np.argmax(mnr_model.predict(), axis=-1)
preds
Out[ ]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Multicolinearity in Multinomial Regression¶

In [ ]:
# - Step 1: recode categorical target variable to integer values, 
# - just in order to be able to run a multiple linear regression on the data:
y_code = y.cat.codes
y_code
Out[ ]:
Id
1      2
2      2
3      2
4      2
5      2
      ..
146    1
147    1
148    1
149    1
150    1
Length: 150, dtype: int8
In [ ]:
### --- Step 2: produce a Multiple Linear Regression model for the data 
mnr_model = sm.OLS(exog=X, endog=y_code).fit()
mnr_model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: y R-squared: 0.571
Model: OLS Adj. R-squared: 0.559
Method: Least Squares F-statistic: 48.17
Date: Wed, 24 May 2023 Prob (F-statistic): 1.02e-25
Time: 17:51:26 Log-Likelihood: -119.03
No. Observations: 150 AIC: 248.1
Df Residuals: 145 BIC: 263.1
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.4405 0.509 -0.866 0.388 -1.446 0.565
SepalLengthCm 0.0872 0.143 0.608 0.544 -0.196 0.371
SepalWidthCm 0.6832 0.149 4.586 0.000 0.389 0.978
PetalLengthCm -0.4413 0.142 -3.117 0.002 -0.721 -0.161
PetalWidthCm 0.4198 0.235 1.789 0.076 -0.044 0.884
Omnibus: 28.857 Durbin-Watson: 0.448
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7.904
Skew: -0.222 Prob(JB): 0.0192
Kurtosis: 1.966 Cond. No. 91.8


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
# --- Step 3: compute VIFs for the predictors
predictors = iris_data.columns.drop('Species')
predictors
Out[ ]:
Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object')
In [ ]:
# - appending the columns of ones to the predictors' data
model_frame_predictors = sm.add_constant(iris_data[predictors])
In [ ]:
# - computing VIFs
vifs = [variance_inflation_factor(model_frame_predictors.values, i) for i in range(1, len(predictors)+1)]
vifs = np.array(vifs).reshape(1, -1)
pd.DataFrame(vifs, columns=predictors)
Out[ ]:
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 7.103113 2.099039 31.397292 16.141564

Multinomial Logistic Regression using scikit-learn¶

In [ ]:
# - import scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
In [ ]:
# - Preparing the variables 

# - feature matrix
X = iris_data.drop(columns='Species').values

# - target variable
y = iris_data['Species']

N.B. scikit-learn does not implement the referrence category automatically!

In [ ]:
# - Fitting the logistic model to the numerical data
# - scikit-learn does not implement the referrence category automatically 
log_reg = LogisticRegression(solver='newton-cg', penalty='none')
log_reg.fit(X, y)
Out[ ]:
LogisticRegression(penalty='none', solver='newton-cg')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(penalty='none', solver='newton-cg')
In [ ]:
# - coefficents of the model; rows correspond to the order of appearance of categories in the target variable
log_reg.coef_, log_reg.intercept_
Out[ ]:
(array([[  4.88575344,   7.84767306, -12.84148798,  -6.66362742],
        [ -1.21026772,  -0.58338902,   1.70604111,  -5.81126865],
        [ -3.67548572,  -7.26428404,  11.13544687,  12.47489607]]),
 array([  2.32919685,  20.15437297, -22.48356982]))
In [ ]:
# - model's accuracy
round(log_reg.score(X, y), 4)
Out[ ]:
0.9867
In [ ]:
# - predictions
y_pred = log_reg.predict(X)
y_pred[:10]
Out[ ]:
array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa'], dtype=object)
In [ ]:
# - confusion matrix for the given data
# - rows/columns rows correspond to the order of appearance of categories in the target variable
confusion_matrix(y, y_pred)
Out[ ]:
array([[50,  0,  0],
       [ 0, 49,  1],
       [ 0,  1, 49]])

Further Reading¶

  • Logistic Regression — Gradient Descent Optimization — Part 1
  • Logistic regression with gradient descent —Tutorial Part 1 — Theory
  • Logistic regression with gradient descent — Tutorial Part 2— CODE
  • LINEAR SUPERVISED LEARNING SERIES: Part 2: Logistic regression

DataKolektiv, 2022/23.

hello@datakolektiv.com

License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.